A cache speeds up the system by reducing latency between the fast and slow medium. The applications of caching are wildly used at different levels in computing system, such as CPU Cache, GPU Cache, Disk Cache, and Web Cache. A cache has limited size and needs an algorithm to decide which elements to replace when it is full. The embedded replacement algorithm is referred to caching policy, or cache replacement policy. When there is request to fetch from the slow medium, the requested element could have existed in the cache. In this scenario, a cache hit is observed. The performance of a caching policy is measured by the cache hit ratio. Many policies have robust performance under some workloads but not the others. This report is to explore the characteristics of different workloads of a CPU cache, and predict the best caching policy to use given the characteristics.
This report focuses on 5 caching policies.
Belady decides the victim to eject by selecting the element with longest reuse distance in the future. Such policy is normally impossible to implement on the real system.
LRU uses the concept of aging. The policy will eject the oldest element when the cache is full.
MRU is the opposite of LRU. It also updates the ages, but the youngest element will be ejected.
We purpose an idea to find the victim based on the pivot. We cache the address of previous access and set it as the pivot. When the new element comes and need to be inserted into the cache, we check the relative direction to the pivot. If it’s on the left side of the pivot, we eject the right most element. Otherwise if it’s on the right side of the pivot, we eject the left most element.
We purpose an idea to find the victim by simply choosing the element with the longest distance. If we use address to calculate the distance, the ejected element would be either at the left or right end.
We collect the workload using Intel Pin to trace memory access for target programs. We implemented a Pin tool which logs the data access in binary form. To observe more representative memory behavior, we use SPEC CPU 2017 benchmarks.
This benchmark uses a cut-down version of Perl v5.22.1. The trace file produced by our PIN tool has a big volume (12 GB). Thus we sample random chunks of the trace file to measure the performance of the caching policies. The number under each policy represents number of cache hits. For each row, all policies operate on the same chunk.
dat.500.perlbench_r <- read.csv('../data/500.perlbench_r.csv')
dat.500.perlbench_rhist(dat.500.perlbench_r$data_size, main='Chunk Size Distribution',
xlab='Chunk Size (number of addresses)', col='deepskyblue')The chunk size is sampled from a uniform distribution. The maximum chunk size is 10k addresses. To make meaningful simulation, we limit the cache size to the number of unique addresses in the chunk. Thus the distribution for the cache sizes are right skewed.
The program implements “Lattice Boltzmann Method” (LBM) to simulate incompressible fluids in 3D. The trace file produced by our PIN tool has a big volume (32 GB). Thus we sample random chunks of the trace file to measure the performance of the caching policies. The number under each policy represents number of cache hits. For each row, all policies operate on the same chunk.
dat.519.lbm_r <- read.csv('../data/519.lbm_r.csv')
dat.519.lbm_rThe program encodes video streams into H.264/MPEG-4 AVC format. The trace file produced by our PIN tool has a big volume (32 GB). Thus we sample random chunks of the trace file to measure the performance of the caching policies. The number under each policy represents number of cache hits. For each row, all policies operate on the same chunk.
dat.525.x264_r <- read.csv('../data/525.x264_r.csv')
dat.525.x264_r531.deepsjeng_r is based on Deep Sjeng WC2008, the 2008 World Computer Speed-Chess Champion. Deep Sjeng is a rewrite of the older Sjeng-Free program, focused on obtaining the highest possible playing strength. The trace file produced by our PIN tool has a big volume (67 GB). Thus we sample random chunks of the trace file to measure the performance of the caching policies. The number under each policy represents number of cache hits. For each row, all policies operate on the same chunk.
dat.531.deepsjeng_r <- read.csv('../data/531.deepsjeng_r.csv')
dat.531.deepsjeng_rThis program calculates Fibonacci recursively without memoization. The trace file produced by our PIN tool has a size of 34 MB. Thus we sample random chunks of the trace file to measure the performance of the caching policies. The number under each policy represents number of cache hits. For each row, all policies operate on the same chunk.
dat.fib <- read.csv('../data/fib.csv')
dat.fibThere are 9 metrics associated with each chuck of the workload.
unique_size: Number of unique addresses appeared in the chunk.reads: Number of memory reads.writes: Number of memory writes.avg_step: Average step size. The step size is the absolute difference between two addresses.avg_reuse_dist: Average distance to the next occurrence of the same address.sum_sqrt_count: Sum of the square root of the frequency for each unique address.sum_count: Sum of the of the frequency for each unique address.lis: The length of the longest increasing subsequence.lds: The length of the decreasing increasing subsequence.add_ratio <- function(dat) {
dat$unique_ratio <- dat$unique_size / dat$data_size
dat$read_ratio <- dat$reads / dat$data_size
dat$write_ratio <- dat$writes / dat$data_size
dat$sum_sqrt_count_ratio <- dat$sum_sqrt_count / dat$data_size
dat$sum_count_ratio <- dat$sum_count / dat$data_size
dat$lis_ratio <- dat$lis / dat$data_size
dat$lds_ratio <- dat$lds / dat$data_size
dat$cache_ratio <- dat$cache_size / dat$data_size
dat$belady_ratio <- (dat$belady + dat$cache_size) / dat$data_size
dat$lru_ratio <- (dat$lru + dat$cache_size) / dat$data_size
dat$pivot_ratio <- (dat$pivot + dat$cache_size) / dat$data_size
dat$ld_ratio <- (dat$ld + dat$cache_size) / dat$data_size
dat$mru_ratio <- (dat$mru + dat$cache_size) / dat$data_size
return(dat)
}
dat.500.perlbench_r <- add_ratio(dat.500.perlbench_r)
dat.519.lbm_r <- add_ratio(dat.519.lbm_r)
dat.525.x264_r <- add_ratio(dat.525.x264_r)
dat.531.deepsjeng_r <- add_ratio(dat.531.deepsjeng_r)
dat.fib <- add_ratio(dat.fib)
D <- list(dat.500.perlbench_r, dat.519.lbm_r, dat.525.x264_r, dat.fib, dat.531.deepsjeng_r)
W <- c('Perl', 'LBM', 'Video compression', 'Fibonacci', 'Pattern Recognition')
plot.hist.metric <- function(D, W, metric) {
par(mfrow=c(3, 2))
for (i in 1:length(D)) {
d <- D[[i]][, metric]
q <- quantile(d, 0.95)
hist(d[d < q], xlab=metric, main=W[i], col='deepskyblue')
}
}plot.hist.metric(D, W, 'unique_ratio')Unique ratio represents the percentage of unique addresses in the sampled chuck. A lower unique ratio indicates most of the addresses in the chuck are the same. In general, most sampled chucks don’t have many different addresses. Thus a higher unique ratio would be less common, and the frequency decreases. Note the unique ratio could be a multimodal distribution, as some of the plots implied.
plot.hist.metric(D, W, 'read_ratio')Read ratio represents the percentage of the observations involve memory read. Read operations tend to occur in big consecutive chucks. Thus sampled chucks are more likely consist of all reads or none reads.
plot.hist.metric(D, W, 'write_ratio')Write ratio represents the percentage of the observations involve memory write. Write operations are distributed more evenly comparing to memory read operations. Most sampled chucks have low write ratio. It’s less likely to observe a sampled chuck purely consists of memory writes.
plot.hist.metric(D, W, 'avg_step')plot.hist.metric(D, W, 'avg_reuse_dist')plot.hist.metric(D, W, 'sum_sqrt_count_ratio')plot.hist.metric(D, W, 'lis_ratio')plot.hist.metric(D, W, 'lds_ratio')Additionally, the performance of each caching policy is also attached to each chuck. The cache size is specified by cache_size. To create meaningful observations, the cache_size is chosen uniform randomly between 1 and the number of unique addresses (unique_size). The performance of a caching policy is defined by its hit ratio, which is calculated by (H + S) / C, where H is the number of hits, S is the cache size, and C is the size of the chuck.
plot.policy.hit_ratio <- function(dat, workload) {
col <- rgb(red=0, green=0, blue=0, alpha=0.2)
par(mfrow=c(2, 3), pty='s')
plot(belady_ratio ~ cache_ratio, dat, pch=19, cex=0.5, col=col, main=paste(workload, '- Belady'))
plot(lru_ratio ~ cache_ratio, dat, pch=19, cex=0.5, col=col, main=paste(workload, '- LRU'))
plot(pivot_ratio ~ cache_ratio, dat, pch=19, cex=0.5, col=col, main=paste(workload, '- Pivot'))
plot(ld_ratio ~ cache_ratio, dat, pch=19, cex=0.5, col=col, main=paste(workload, '- LD'))
plot(mru_ratio ~ cache_ratio, dat, pch=19, cex=0.5, col=col, main=paste(workload, '- MRU'))
}
plot.policy.hit_ratio(dat.500.perlbench_r, 'Perl')The hit ratio distribution of LRU is closer to the optimal policy - Belady. This plot suggest LRU is more likely to be the best caching policy for this workload.
plot.policy.hit_ratio(dat.fib, 'Fibonacci')